Analysis of U.S. Incomes by Occupation and Gender

Notebook by Harish Kesava Rao

Use of this dataset should cite the Bureau of Labor Statistics as per their copyright information: The Bureau of Labor Statistics (BLS) is a Federal government agency and everything that we publish, both in hard copy and electronically, is in the public domain, except for previously copyrighted photographs and illustrations. You are free to use our public domain material without specific permission, although we do ask that you cite the Bureau of Labor Statistics as the source.

What kind of questions are we aiming to answer through the data analysis?

Is there an inequality of income in the labor force?
Is it more pronounced in certain sectors than others?
Can the dataset help us answer the questions?
a. Is the dataset in a useable format?
b. How much of pre-analysis preparation is required before we start analyzing the dataset?
What other inferences can we derive out of the dataset?

Analyzing the dataset

Median weekly earnings of full-time wage and salary workers by detailed occupation and sex.
Occupation: Job title as given from BLS. Industry summaries are given in ALL CAPS.
All_workers: Number of workers male and female, in thousands.
All_weekly: Median weekly income including male and female workers, in USD.
M_workers: Number of male workers, in thousands.
M_weekly: Median weekly income for male workers, in USD.
F_workers: Number of female workers, in thousands.
F_weekly: Median weekly income for female workers, in USD.



In [194]:

    
import pandas as pd
from pandas import DataFrame, Series

We assign more meaningful and self explanatory column headers. The names list can be assigned to the names parameter when reading in the csv file.



In [195]:

    
names = ['Occupations','All_workers', 'Weekly_Income_Overall', 
                             'Male_Workers (in 1000s)', 'Median_weekly_income (Male)',
                             'Female_Workers (in 1000s)', 'Median_weekly_income (Female)']



In [196]:

    
#mac
income = pd.read_csv('/Users/Harish/Documents/HK_Work/Python/Python-for-Data-Analysis/Kaggle-US-Incomes/inc_occ_gender.csv'
                   ,names=names, header=None)

#win
#income = pd.read_csv(r'C:\Users\\Documents\Personal\Python-for-Data-Analysis\Kaggle-US-Incomes\inc_occ_gender.csv'
                # ,names=names, header=None)

To manipulate and perform operations on numerical columns, we will need to handle null/empty/non-numerical values in the numerical columns. In our dataset, this means everything except the Occupations column.



In [197]:

    
income.head()









    Out[197]:






  
    
      
      Occupations
      All_workers
      Weekly_Income_Overall
      Male_Workers (in 1000s)
      Median_weekly_income (Male)
      Female_Workers (in 1000s)
      Median_weekly_income (Female)
    
  
  
    
      0
      Occupation
      All_workers
      All_weekly
      M_workers
      M_weekly
      F_workers
      F_weekly
    
    
      1
      ALL OCCUPATIONS
      109080
      809
      60746
      895
      48334
      726
    
    
      2
      MANAGEMENT
      12480
      1351
      7332
      1486
      5147
      1139
    
    
      3
      Chief executives
      1046
      2041
      763
      2251
      283
      1836
    
    
      4
      General and operations managers
      823
      1260
      621
      1347
      202
      1002

The original header that was part of the csv can be dropped



In [198]:

    
income.drop(0, inplace=True)



In [229]:

    
income['income_index'] = income.index

The following looks more readable



In [232]:

    
income.head()









    Out[232]:






  
    
      
      Occupations
      All_workers
      Weekly_Income_Overall
      Male_Workers (in 1000s)
      Median_weekly_income (Male)
      Female_Workers (in 1000s)
      Median_weekly_income (Female)
      income_index
    
  
  
    
      1
      ALL OCCUPATIONS
      109080
      809
      60746
      895
      48334
      726
      1
    
    
      2
      MANAGEMENT
      12480
      1351
      7332
      1486
      5147
      1139
      2
    
    
      3
      Chief executives
      1046
      2041
      763
      2251
      283
      1836
      3
    
    
      4
      General and operations managers
      823
      1260
      621
      1347
      202
      1002
      4
    
    
      5
      Legislators
      8
      0
      5
      0
      4
      0
      5

Now, we will slice the dataset to understand more about the data



In [200]:

    
#selecting just the occupations and worker count into a new dataframe
job_type_by_worker_count = income[['Occupations','All_workers']]



In [201]:

    
job_type_by_worker_count.sort_values(by='All_workers', ascending =False)









    Out[201]:






  
    
      
      Occupations
      All_workers
    
  
  
    
      204
      Miscellaneous health technologists and technic...
      99
    
    
      186
      Respiratory therapists
      99
    
    
      279
      SALES
      9725
    
    
      446
      Miscellaneous assemblers and fabricators
      950
    
    
      40
      Cost estimators
      95
    
    
      326
      Reservation and transportation ticket agents a...
      95
    
    
      7
      Marketing and sales managers
      948
    
    
      246
      Food servers, nonrestaurant
      93
    
    
      109
      Chemists and materials scientists
      93
    
    
      346
      Office clerks, general
      929
    
    
      284
      Parts salespersons
      92
    
    
      130
      Miscellaneous community and social service spe...
      92
    
    
      141
      Postsecondary teachers
      917
    
    
      216
      Phlebotomists
      91
    
    
      390
      Highway maintenance workers
      91
    
    
      110
      Environmental scientists and geoscientists
      90
    
    
      70
      Database administrators
      90
    
    
      180
      Podiatrists
      9
    
    
      265
      Ushers, lobby attendants, and ticket takers
      9
    
    
      437
      Signal and track switch repairers
      9
    
    
      391
      Rail-track laying and maintenance equipment op...
      9
    
    
      513
      Adhesive bonding machine operators and tenders
      9
    
    
      91
      Marine engineers and naval architects
      9
    
    
      449
      Food and tobacco roasting, baking, and drying ...
      9
    
    
      31
      Emergency management directors
      9
    
    
      484
      Textile cutting machine setters, operators, an...
      9
    
    
      228
      Parking enforcement workers
      9
    
    
      380
      Reinforcing iron and rebar workers
      9
    
    
      366
      Carpet, floor, and tile installers and finishers
      89
    
    
      255
      Maids and housekeeping cleaners
      876
    
    
      ...
      ...
      ...
    
    
      1
      ALL OCCUPATIONS
      109080
    
    
      164
      Editors
      108
    
    
      187
      Speech-language pathologists
      108
    
    
      43
      Training and development specialists
      107
    
    
      247
      Dining room and cafeteria attendants and barte...
      107
    
    
      56
      Insurance underwriters
      106
    
    
      382
      Sheet metal workers
      106
    
    
      319
      Interviewers, except eligibility and loan
      105
    
    
      144
      Secondary school teachers
      1048
    
    
      3
      Chief executives
      1046
    
    
      337
      Stock clerks and order fillers
      1027
    
    
      327
      Information and record clerks, all other
      100
    
    
      262
      Nonfarm animal caretakers
      100
    
    
      369
      Paving, surfacing, and tamping equipment opera...
      10
    
    
      536
      Sailors and marine oilers
      10
    
    
      512
      Semiconductor processors
      1
    
    
      343
      Desktop publishers
      1
    
    
      433
      Commercial divers
      1
    
    
      106
      Life scientists, all other
      1
    
    
      461
      Milling and planing machine setters, operators...
      1
    
    
      466
      Multiple machine tool setters, operators, and ...
      1
    
    
      78
      Miscellaneous mathematical science occupations
      1
    
    
      481
      Shoe machine operators and tenders
      1
    
    
      493
      Model makers and patternmakers, wood
      0
    
    
      377
      Paperhangers
      0
    
    
      358
      Hunters and trappers
      0
    
    
      115
      Sociologists
      0
    
    
      113
      Survey researchers
      0
    
    
      556
      Mine shuttle car operators
      0
    
    
      487
      Extruding and forming machine setters, operato...
      0
    
  

558 rows × 2 columns

The sorting did not happen by the All_workers column. Let's see why that is the case.



In [202]:

    
income.dtypes









    Out[202]:





Occupations                      object
All_workers                      object
Weekly_Income_Overall            object
Male_Workers (in 1000s)          object
Median_weekly_income (Male)      object
Female_Workers (in 1000s)        object
Median_weekly_income (Female)    object
dtype: object

The columns are objects. We will need to convert them to primitive types, such as integer, string etc.



In [203]:

    
job_type_by_worker_count2 = income.apply(pd.to_numeric, errors = 'ignore')



In [204]:

    
job_type_by_worker_count2.dtypes









    Out[204]:





Occupations                      object
All_workers                       int64
Weekly_Income_Overall            object
Male_Workers (in 1000s)           int64
Median_weekly_income (Male)      object
Female_Workers (in 1000s)         int64
Median_weekly_income (Female)    object
dtype: object

The conversion happened for All_workers, Male_Workers (in 1000s), Female_Workers (in 1000s), but not for the other columns. We will see why.



In [205]:

    
job_type_by_worker_count2.head(n=15)









    Out[205]:






  
    
      
      Occupations
      All_workers
      Weekly_Income_Overall
      Male_Workers (in 1000s)
      Median_weekly_income (Male)
      Female_Workers (in 1000s)
      Median_weekly_income (Female)
    
  
  
    
      1
      ALL OCCUPATIONS
      109080
      809
      60746
      895
      48334
      726
    
    
      2
      MANAGEMENT
      12480
      1351
      7332
      1486
      5147
      1139
    
    
      3
      Chief executives
      1046
      2041
      763
      2251
      283
      1836
    
    
      4
      General and operations managers
      823
      1260
      621
      1347
      202
      1002
    
    
      5
      Legislators
      8
      Na
      5
      Na
      4
      Na
    
    
      6
      Advertising and promotions managers
      55
      1050
      29
      Na
      26
      Na
    
    
      7
      Marketing and sales managers
      948
      1462
      570
      1603
      378
      1258
    
    
      8
      Public relations and fundraising managers
      59
      1557
      24
      Na
      35
      Na
    
    
      9
      Administrative services managers
      170
      1191
      96
      1451
      73
      981
    
    
      10
      Computer and information systems managers
      636
      1728
      466
      1817
      169
      1563
    
    
      11
      Financial managers
      1124
      1408
      551
      1732
      573
      1130
    
    
      12
      Compensation and benefits managers
      23
      Na
      7
      Na
      16
      Na
    
    
      13
      Human resources managers
      254
      1365
      68
      1495
      186
      1274
    
    
      14
      Training and development managers
      37
      Na
      17
      Na
      20
      Na
    
    
      15
      Industrial production managers
      267
      1485
      221
      1528
      45
      Na

The columns for which the type conversion did not happend contain 'Na' values in them. These look like NULL or NA values, but actually, they are strings with the value 'Na'. We can work around these values by replacing them with 0.



In [206]:

    
income.replace('Na',0,inplace=True)



In [207]:

    
income.head(n=15)









    Out[207]:






  
    
      
      Occupations
      All_workers
      Weekly_Income_Overall
      Male_Workers (in 1000s)
      Median_weekly_income (Male)
      Female_Workers (in 1000s)
      Median_weekly_income (Female)
    
  
  
    
      1
      ALL OCCUPATIONS
      109080
      809
      60746
      895
      48334
      726
    
    
      2
      MANAGEMENT
      12480
      1351
      7332
      1486
      5147
      1139
    
    
      3
      Chief executives
      1046
      2041
      763
      2251
      283
      1836
    
    
      4
      General and operations managers
      823
      1260
      621
      1347
      202
      1002
    
    
      5
      Legislators
      8
      0
      5
      0
      4
      0
    
    
      6
      Advertising and promotions managers
      55
      1050
      29
      0
      26
      0
    
    
      7
      Marketing and sales managers
      948
      1462
      570
      1603
      378
      1258
    
    
      8
      Public relations and fundraising managers
      59
      1557
      24
      0
      35
      0
    
    
      9
      Administrative services managers
      170
      1191
      96
      1451
      73
      981
    
    
      10
      Computer and information systems managers
      636
      1728
      466
      1817
      169
      1563
    
    
      11
      Financial managers
      1124
      1408
      551
      1732
      573
      1130
    
    
      12
      Compensation and benefits managers
      23
      0
      7
      0
      16
      0
    
    
      13
      Human resources managers
      254
      1365
      68
      1495
      186
      1274
    
    
      14
      Training and development managers
      37
      0
      17
      0
      20
      0
    
    
      15
      Industrial production managers
      267
      1485
      221
      1528
      45
      0

This looks better, without the Na string values. Now, we will convert all the columns, except Occupations to int. Occupations is a string column and we can leave it as an object type.



In [208]:

    
income_clean = income.apply(pd.to_numeric, errors = 'ignore')



In [209]:

    
income_clean.dtypes









    Out[209]:





Occupations                      object
All_workers                       int64
Weekly_Income_Overall             int64
Male_Workers (in 1000s)           int64
Median_weekly_income (Male)       int64
Female_Workers (in 1000s)         int64
Median_weekly_income (Female)     int64
dtype: object

We will now slice the data and analyze the count of workers in each occupation.



In [210]:

    
top10_worker_count = income_clean.sort_values(by='All_workers', axis=0, 
                                                           ascending=False)[['Occupations','All_workers',
                                                                                    'Male_Workers (in 1000s)',
                                                                                    'Female_Workers (in 1000s)']]



In [211]:

    
top10_worker_count.drop(1, axis=0, inplace=True)

The first row shows the total number of workers. We will drop the row since it is not useful for our analysis and will skew the overall pattern.



In [212]:

    
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline



In [228]:

    
top_worker_count_barplot = sb.factorplot(x='Occupations',
           y='All_workers', data=top10_worker_count.head(n=5), hue='All_workers', 
                                     size=7, aspect=2.3,kind='bar')
top_worker_count_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top worker counts and the occupations")
sb.set(font_scale=2.3)

We have a plot showing top employee count for 5 occupations. We will append gender wise numbers on the same plot. For this purpose, we use pandas' melt function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html



In [ ]:

    
top10_worker_count.head()



In [ ]:

    
top10_worker_count = pd.melt(top10_worker_count, id_vars=["Occupations"], 
                 value_vars=['All_workers','Male_Workers (in 1000s)','Female_Workers (in 1000s)'])



In [ ]:

    
top10_mf_count = income_clean.sort_values(by='All_workers', axis=0, 
                                                           ascending=False)[['Occupations',
                                                                                    'Male_Workers (in 1000s)',
                                                                                    'Female_Workers (in 1000s)']]



In [ ]:

    
top10_mf_count.drop(1, axis=0, inplace=True)



In [ ]:

    
top10_male_female = pd.melt(top10_mf_count, id_vars=["Occupations"], 
                 value_vars=['Male_Workers (in 1000s)','Female_Workers (in 1000s)'])



In [ ]:

    
top10_worker_count.replace('NaN',0,inplace=True)
top10_male_female.replace('NaN',0,inplace=True)



In [ ]:

    
sorted_worker_count_n = top10_worker_count.sort_values(by='value', axis=0, 
                                                           ascending=False)[['Occupations','variable','value']]



In [ ]:

    
sorted_male_female_count_n = top10_male_female.sort_values(by='value', axis=0, 
                                                           ascending=False)[['Occupations','variable','value']]



In [ ]:

    
sorted_worker_count_n.head(n=5)



In [214]:

    
sorted_male_female_count_n.head(n=5)









    Out[214]:






  
    
      
      Occupations
      variable
      value
    
  
  
    
      557
      OFFICE
      Female_Workers (in 1000s)
      9933
    
    
      1
      MANAGEMENT
      Male_Workers (in 1000s)
      7332
    
    
      4
      TRANSPORTATION
      Male_Workers (in 1000s)
      5998
    
    
      8
      CONSTRUCTION
      Male_Workers (in 1000s)
      5586
    
    
      3
      PRODUCTION
      Male_Workers (in 1000s)
      5548



In [222]:

    
sorted_worker_count_n_barplot = sb.factorplot(x='Occupations',
           y='value', data=sorted_worker_count_n.head(n=10), hue='variable', 
                                         palette="bright",
                                     size=10, aspect=5,kind='bar')
sorted_worker_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top worker counts and the occupations")
sb.set(font_scale=4.0)

In the above plot, we see that there are some professions involving only male employees - such as pales, production, transportation, construction etc. Similarly, there are occuapations that involve only female employees - such as healthcare professionals, education etc. This is just a sample of 20 occupations. We can analyze more, as seen below.



In [ ]:

    
sorted_male_female_count_n_barplot = sb.factorplot(x='Occupations',
           y='value', data=sorted_male_female_count_n.head(n=10), hue='variable', 
                                         palette="bright",
                                     size=12, aspect=5,kind='bar')
sorted_male_female_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top male, female employee counts and the occupations")
sb.set(font_scale=4)

We can observe a similar trend in the above plot as well, where transportation, construction, production, maintenance and groundskeeping involve male employees and education and healthcare involves female employees. We can also see that business, office, sale etc. have both female and male employees.

Now, let us sort the Occupations by weekly median income, irrespective of gender of employees



In [ ]:

    
top10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0, 
                                                           ascending=False)[['Occupations',
                                                                                    'Weekly_Income_Overall']]

low10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0, 
                                                           ascending=True)[['Occupations',
                                                                                    'Weekly_Income_Overall']]



In [227]:

    
top10_income_count_barplot = sb.factorplot(x='Occupations',
           y='Weekly_Income_Overall', data=top10_income_count.head(n=5), hue='Occupations', 
                                         palette="bright",
                                     size=12, aspect=5, kind='bar')
top10_income_count_barplot.set(xlabel = "Occupations", ylabel = "Weekly Income", 
                             title = "Top Weekly Incomes and the respective occupations")
sb.set(font_scale=2)
#size=12, aspect=5,



In [160]:

    
low10_income_count = low10_income_count[(low10_income_count != 0).all(1)]



In [162]:

    
# lowest10_income_count_barplot = sb.factorplot(x='Weekly_Income_Overall',
#             y='Occupations', data=low10_income_count.head(n=10), hue='Occupations',palette="bright",
#                                              size=12, aspect=5, kind='bar')
# lowest10_income_count_barplot.set(xlabel='Occupations', ylabel = 'Weekly Income',
#                                  title = 'Lowest Weekly Incomes and the respective occupations')

	Occupations	All_workers	Weekly_Income_Overall	Male_Workers (in 1000s)	Median_weekly_income (Male)	Female_Workers (in 1000s)	Median_weekly_income (Female)
0	Occupation	All_workers	All_weekly	M_workers	M_weekly	F_workers	F_weekly
1	ALL OCCUPATIONS	109080	809	60746	895	48334	726
2	MANAGEMENT	12480	1351	7332	1486	5147	1139
3	Chief executives	1046	2041	763	2251	283	1836
4	General and operations managers	823	1260	621	1347	202	1002

	Occupations	All_workers
204	Miscellaneous health technologists and technic...	99
186	Respiratory therapists	99
279	SALES	9725
446	Miscellaneous assemblers and fabricators	950
40	Cost estimators	95
326	Reservation and transportation ticket agents a...	95
7	Marketing and sales managers	948
246	Food servers, nonrestaurant	93
109	Chemists and materials scientists	93
346	Office clerks, general	929
284	Parts salespersons	92
130	Miscellaneous community and social service spe...	92
141	Postsecondary teachers	917
216	Phlebotomists	91
390	Highway maintenance workers	91
110	Environmental scientists and geoscientists	90
70	Database administrators	90
180	Podiatrists	9
265	Ushers, lobby attendants, and ticket takers	9
437	Signal and track switch repairers	9
391	Rail-track laying and maintenance equipment op...	9
513	Adhesive bonding machine operators and tenders	9
91	Marine engineers and naval architects	9
449	Food and tobacco roasting, baking, and drying ...	9
31	Emergency management directors	9
484	Textile cutting machine setters, operators, an...	9
228	Parking enforcement workers	9
380	Reinforcing iron and rebar workers	9
366	Carpet, floor, and tile installers and finishers	89
255	Maids and housekeeping cleaners	876
...	...	...
1	ALL OCCUPATIONS	109080
164	Editors	108
187	Speech-language pathologists	108
43	Training and development specialists	107
247	Dining room and cafeteria attendants and barte...	107
56	Insurance underwriters	106
382	Sheet metal workers	106
319	Interviewers, except eligibility and loan	105
144	Secondary school teachers	1048
3	Chief executives	1046
337	Stock clerks and order fillers	1027
327	Information and record clerks, all other	100
262	Nonfarm animal caretakers	100
369	Paving, surfacing, and tamping equipment opera...	10
536	Sailors and marine oilers	10
512	Semiconductor processors	1
343	Desktop publishers	1
433	Commercial divers	1
106	Life scientists, all other	1
461	Milling and planing machine setters, operators...	1
466	Multiple machine tool setters, operators, and ...	1
78	Miscellaneous mathematical science occupations	1
481	Shoe machine operators and tenders	1
493	Model makers and patternmakers, wood	0
377	Paperhangers	0
358	Hunters and trappers	0
115	Sociologists	0
113	Survey researchers	0
556	Mine shuttle car operators	0
487	Extruding and forming machine setters, operato...	0

	Occupations	variable	value
557	OFFICE	Female_Workers (in 1000s)	9933
1	MANAGEMENT	Male_Workers (in 1000s)	7332
4	TRANSPORTATION	Male_Workers (in 1000s)	5998
8	CONSTRUCTION	Male_Workers (in 1000s)	5586
3	PRODUCTION	Male_Workers (in 1000s)	5548